\section{Introduction}

The Toolkit for Advanced Optimization (TAO)
focuses on the design and implementation of
component-based optimization software for the
solution of large-scale optimization applications
on high-performance architectures.
Our approach is motivated by the
scattered support for parallel computations and
lack of reuse of linear algebra software in
currently available optimization software.
%  We exploit numerical abstractions in the optimization software
%  design so that we can leverage external parallel computing
%  infrastructure (for example, communication libraries and visualization
%  packages) and linear algebra tools in the
%  development of optimization algorithms. The algorithms in the toolkit
%  place strong emphasis on the reuse of external tools where appropriate.
Our design enables connection to lower-level
support (parallel sparse matrix data
structures, preconditioners, solvers) provided in toolkits such as
PETSc \cite{petsc,PETSc-user-ref},
and thus we are able to build on top of these toolkits
instead of having to redevelop code. The advantages in
terms of development time are significant.

%  Initial work in the TAO project \cite{tao-user-ref}
%  has centered on the development of a
%  core library of components for various types of optimization problems,
%  including unconstrained and bound-constrained minimization and
%  nonlinear least squares.  
A major concern in the TAO project \cite{tao-user-ref}
is the performance and scalability
of optimization algorithms on large problems.
In this case study
%  To explain the TAO design strategy and
%  analyze parallel performance issues, 
we focus on the GPCG
(gradient projection, conjugate gradient)
algorithm \cite{more-toraldo} 
for the solution
of the bound-constrained quadratic programming problem
\begin{equation} \label{def_bqp}
\min \{ q(x) : l \leq x \leq u \} ,
\end{equation}
where $q : \R^n \mapsto \R $ is a strictly convex quadratic function,
and the vectors $l$ and $u$ define bounds on the variables.
Although GPCG had been originally designed 
for large-scale problems, implementation of GPCG on a 
parallel architecture presented significant obstacles that
are typical of a large class of optimization algorithms.
The most significant obstacle arises from the 
method used to compute the step between iterates.
Specifically, in modern active set methods for solving
\Ref{def_bqp}, the step between iterates is usually defined
via the approximate solution of a linear system of the form
\[
A_k w_k = - r_k ,
\]
where the matrix $ A_k $ and the vector $r_k$ are, respectively,
the reduced Hessian matrix and the reduced gradient
of $q$ with respect to the free variables.
In a parallel environment, the efficient implementation of the conjugate
gradient method requires that $A_k$ 
be evenly distributed over the processors,
but since the set of free variables can change
drastically between iterates, the reduced matrix is
unlikely to be well distributed. Hence,
a redistribution of the rows of $A_k$ over the processors
may be necessary at each iteration.

This observation implies that the
scalability of the GPCG algorithm is limited not only by the efficiency
of the redistribution algorithm but also by the sizes of
the matrices $A_k$. If the set of free variables is large,
then performance is likely to improve because the
communication overhead is relatively small, while performance is
likely to suffer when there are few free variables.
The performance and scalability of the GPCG algorithm also depend
on the preconditioner used by the CG algorithm.
While the goal of preconditioning is to reduce both the number of
floating point operations and the overall computing time for solving
a problem, it is possible that the solution time may increase
for certain preconditioners.

In this paper we study the performance and scalability
of the GPCG algorithm as a function of the 
number of variables, the number of free variables, and the preconditioner.
These issues are relevant to general optimization algorithms and
to the development of scalable optimization algorithms, and thus
the GPCG algorithm is a prime candidate for a case study
in the performance and scalability of optimization algorithms
in parallel architectures.

Our implementation of GPCG uses object-oriented techniques
to leverage the parallel
computing and linear algebra infrastructure offered by PETSc
\cite{petsc,PETSc-user-ref}, which relies on the Message-Passing 
Interface (MPI) \cite{using-mpi} standard for all
interprocessor communication.  
As a result, our implementation runs on a wide variety of
high-performance architectures.
Biros and Ghattas \cite{GB99b,GB99a} have been using
a similar approach for the solution of PDE-constrained optimization problems.
They have also been concerned with efficiency and scalability
issues, but for quadratic problems with linear equality constraints.
As we have pointed out, inequality constrained optimization
problems give rise to different performance issues.
Hohmann \cite{hohmann:94}, 
Deng, Gouveia and Scales \cite{HLD94},
Meza \cite{meza:94},
Bruhwiler et al. \cite{bsca98}, and Gockenbach, Petro, and Symes
\cite{Gockenbach:1999:CCL}
have employed object-oriented design
for nonlinear optimization, but their work
does not address the reuse of linear algebra toolkits and
is restricted to uniprocessor environments.
Our use of object-oriented techniques and linear algebra
toolkits also distinguishes our implementation of GPCG from
the data-parallel implementation of McKenna, Mesirov, and
Zenios \cite{MOM95}. In particular, they can rely only
on diagonal preconditioners, while our approach allows a wide
range of preconditioners.

Sections \ref{qp} and \ref{alg} are dedicated to
background material on the bound-constrained 
optimization problem \Ref{def_bqp} and to
a brief overview of the GPCG algorithm, while
Section \ref{design} has a discussion
of our design philosophy and its benefits in
developing robust and scalable solutions strategies.

The performance results
in Section \ref{sec:performance} are noteworthy in several ways.
First, the number of faces visited by GPCG is remarkably small.
Other strategies can lead to a large number of gradient projection
iterates, but the GPCG algorithm is remarkably efficient.
Another interesting aspect is that because of the
low memory requirements of iterative solvers, we are able
to solve problems with over 2.5 million variables
with only $ 8 $ processors.
Strategies that rely on direct solvers are likely to need
significantly more storage, and thus more processors.

Section \ref{sec:analysis} examines the scalability of the
GPCG component functions and the performance of GPCG
as the number of variables and the number of active
variables at the solution change. These results illustrate the
complex performance behavior for constrained optimization
problems as well as the observation that
performance results that focus only on efficiency can be
deceiving if the total computing time is not taken into account.

Section \ref{sec:preconditioners} considers the performance
of GPCG as the preconditioners change. The
ability to use various preconditioners is
a result of our design, which allows the
connection to external linear algebra toolkits.
Our results in this section show that for our benchmark
problem, a block Jacobi preconditioner with
one block per processor, where each subproblem is solved
with a standard, sparse ILU(2) factorization, is faster than
a variant with ILU(0). We also show that both block Jacobi variants are faster
than a simple point Jacobi method,
although the point Jacobi preconditioner exhibits
better scalability.


